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Intermittent density fluctuations of nucleotide molecules (adenine, guanine, cytosine and thymine) 
along DNA sequences are studied in the framework of a hierarchical structure (HS) model originally 
proposed for the study of fully developed turbulence [She and Leveque, Phys. Rev. Lett. 72, 
336 (1994)]. Large scale (10 3 < I < 10 5 bp) base density fluctuation is shown to satisfy the HS 
similarity. The derived values of a HS parameter (5 from a large number of genome data (including 
Bacteria, Archaea, human chromosomes and viruses) characterize different biological properties such 
as strand symmetry, phylogenetic relations and horizontal gene transfer. It is suggested that the 
HS analysis offers a useful quantitative description for heterogeneity, sequence complexity and large 
scale structures of genomes. 

PACS numbers: 87.14.Gg; 87.15. Aa; 87.15.Cc 



I. INTRODUCTION 

The DNA sequence of a complete genome of an organ- 
ism contains the information not only for making all the 
proteins (genes) necessary for the organism, but also for 
assembling them to form the organism in a specific time 
order with specific three-dimensional patterns. While 
small-scale (from several to hundreds base pairs) pat- 
terns of the nucleotide arrangement are certainly impor- 
tant for determining its coding or non-coding nature and 
some regulatory biological functions (e.g. binding site or 
splicing site signal) :1], more large-scale variation across 
several thousands base pairs or longer may be related to 
higher level biological functions such as controlling net- 
works of genes which are likely important indices in evo- 
lution 0]. It is important to develop tools for analyzing 
these patterns with the available sequence data and to 
use it as a laboratory for quantitative exploring biolog- 
ical laws such as the mechanism of biological evolution 

1 

There has been considerable efforts in studying the sta- 
tistical property of nucleotide distribution pattern £|. 
The concept of "domains-within-domains" has been in- 
troduced in Ref. [f| and confirmed in Ref. The algo- 
rithms for DNA sequence alignment and similarity search 
have been developed for the study ofphylogeny and evo- 
lution of many biological species 0. Other methods 
developed in nonlinear analysis and information theory 
were introduced to characterize coding and non-coding 
DNA sequences 8]. Beside these studies focused on lo- 
cal motifs of the DNA sequence, many other methods, 
including statistical physics analysis spectrum anal- 
ysis [E EH El > wavelet analysis [l^], etc, have also been 
proposed to measure the correlation between nucleotides 
over long distances along one-dimensional DNA chain. 
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Although long-range correlation in DNA sequences has 
been established |l3|. the nature and the significance of 
this correlative property remain under debate Of 
previous analysis of interesting scaling behaviors of DNA 
sequence, the most famous one is the 1 / /-like power 
law at moderate length scales (typically 10 — 1000 bp) 
[Tfl ITEj . Less effort has been performed to examine 
larger scale correlations, partially due to the lack of very 
long sequences in the past decades. Recently, large scale 
structure of DNA seq uence, especially complete genomes, 
has been studied [l(j. Such large scale structure at the 
genomic level contains the global evolution information 
which is lack in small scale one. Traditional methods, 
however, is ineffective to analyze long range correlation 
structure at the whole genome level. For example, local 
"base-base" correlation is difficult to reveal the corre- 
sponding biological meaning |l3|: power spectrum anal- 
ysis is impractical due to computer limitation [l6]. The 
present study gives a different approach which may be 
applicable to such large scale correlation at the genome 
level. 

We have briefly introduced a hierarchical structure 
(HS) description of multiple scale structures of DNA se- 
quences [lTj based on an earlier HS model for hydrody- 
namic turbulence [Tsj . The starting point of the analysis 
is to construct a nucleotide density fluctuation visual- 
ization and the probability density function (PDF) de- 
scription, then perform a multiple moment scaling anal- 
ysis , and apply the HS scaling model to characterize 
the fluctuation structure. This methodology makes it 
possible to study correlation structures up to more than 
10 5 bp. Our analysis reveals that the nucleotide com- 
position variations along genomes are far from random, 
but present a complex self-organized structure, called in- 
termittent structure, which can be captured by the HS 
analysis. In this work, we will show that a detailed 
study of the systematic variation of HS parameter /? 
measured from more than one hundred sequences of four 
kingdoms (Bacteria, Archaea, human chromosomes and 



2 



1=1024 bp 

(a) #y*tttw^ 



0,4 r 

0.3 

I 

'0.2 



0.1 L- 

0.4 r 



0.3 



(b) ^ f wm^ v wWty o.24 ■ 



(!) * 
4~ 



(c) 



(d) 



0.28p 

0.26 

0.24 

0.22 

0.28p 

0.26 



H 31 072 bp 



1000 
100 



0.22 

0.28r 

0.26 

















\ i 



0.0 2.0x10 s 4.0x10 s 6.0x10 s 8.0x10 5 0.0 2.0x10 s 4.0x10 s 6.0x10 s 8.0x10 s 




100 
10 



01 
1E-3 
1E-4 
1E-5 
1E-6 







- f ■ 

A 











'A 





FIG. 2: Typical PDFs of guanine density (G) ft of (a) Ran- 
dom, (b) Simulation, (c) Ecoli and (d) Hsap4 at two scales 
f-min = 2 10 (circle) and £ ma x = 2 17 (triangle). Note that 
with £ decreasing, the right wings of PDFs progress further, 
indicating the appearance of high intensity fluctuations in // 
which can be captured by the HS analysis. 



FIG. 1: Nucleotide guanine density (G) variation ff of (a) 
Random, (b) Simulation, (c) Ecoli and (d) Hsap4. Local den- 
sities are calculated over a window of l m in = 2 10 bp (the 
left) and l m ax = 2 17 bp (the right), respectively. The sliding 
window moves at a step of length A = 1024 bp. Note the 
intensive fluctuation of natural sequence away from artificial 
ones. 

viruses) reveals significantly biological information, such 
as strand symmetry, phylogenetic relations and horizon- 
tal gene transfer. 

The paper is organized as follows: A multi-scale 
variable fi concerning base composition fluctuations of 
genome sequences is introduced in Sec. [HI We present 
briefly measurements of scaling property with special em- 
phasis on the HS model and similarity test (/3-test) in 
Sec. IIIII Section IIVI is devoted to a detailed HS analy- 
sis of various kinds of genome data. Section ]V\ offers a 
summary and some additional discussion. 



II. BASE COMPOSITION FLUCTUATIONS 

A single-stranded DNA chain can be viewed as a sym- 
bolic series {rii]{i — 1, 2, L) of length L comprised of 
four nucleotides A, C, G and T. There are many kinds 
of transformation of DNA sequences to capturing certain 
properties, such as the "DNA walk" |jj , which construct a 
numerical sequence {u{\ by a certain mapping rule (e.g., 
adenine rule: if rii = A then Ui = 1; in all other cases 
Ui = 0). Then a running sum y(n) — J2i=i[ u i ~ 0-~ u i)\ 
can be presented graphically as a one-dimensional land- 



scape of the original DNA sequence. Here we employ 
an alternative approach that introduces a window with 
length i bp on the DNA sequence, and define the (local) 
density of a particular base as 

, i+i-l 
k—i 

where i is the location of the first base within the window. 
The definition can be used to any single nucleotide (A, C, 
G, or T) or their degeneracy (R, Y, etc.) and to any dinu- 
cleotide molecules (AT, AG, etc.). By sliding the window 
with a certain moving step A and changing the window 
size I along the DNA sequence, we can obtain different 
fluctuation sequences fi. When I = L, fi become the 
mean content of the nucleotide in the entire DNA chain. 
This multi-scale variable fi, similar to the locally aver- 
aged energy dissipation rate eg in the turbulence field, 
is the coarse gaining of base density of the original DNA 
sequence, which allow us to calculate the probability den- 
sity functions (PDF) P{fi) and other quantities of fi of 
interest. 

The fluctuation structures of DNA sequences can be 
shown by a plot of local base density ft against the 
sequence position. Figure ^ displays a segment of 0.8 
million bp guanine density (G) fluctuations ft with two 
scales 2 10 (» 10 3 bp) and 2 17 (« 10 5 bp) for four se- 
quences: an independent identical distribution (i.i.d.) 
random sequence with ten million bp and 50% A+T con- 
tent (Random), a simulated genome sequence with one 
million bp by the minimal model (Simulation) poj . E. coli 
whole genome (Ecoli) and H. sapiens chromosome 4 con- 
tig 8 (Hsap4), respectively. The random sequence shows 
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no surprising white noise signal at both scales and has 
the least fluctuations amplitude. The simulated sequence 
contains some visible tips but in a whole is stationary. 
The E. coli genome contains many low guanine density 
region which is atypical to the main body. The Hsap4 se- 
quence with the special high guanine density seems most 
intermittent, which include many strong bursts breaking 
against the background and the highest fluctuation am- 
plitude among the four. We believe that the transition of 
the fluctuation from small scales to large ones is of special 
interest to reveal the global information of genome. 

With the multi-scale variable ft, we can carry out the 
multi-scale PDF method to characterize the interesting 
structures of such sequences. Firstly we perform an ana- 
lytical discussion on the random control sequence called 
i.i.d., i.e., the probability p of each base with Ui = 1 
(guanine rule) is equal to 0.25. For a certain window size 
£, the number of guanine in the windo w y] itj exactly 
obeys the binomial distribution B(£,p) 21]. Thus the 
PDF of the density fe with a binomial shape is asym- 
metrical when the of trials is small, and approximates to 
be symmetrical when the number of trials large enough. 
Furthermore, if the "success" probability p of each trial 
is fixed between and 1 (here p = 0.25), binomial distri- 
bution will approximate to Gaussian distribution. So for 
the larger window size, PDFs of fe will have a Gaussian 
shape. We find that at a small scale (less than several 
hundred bp) the right wing of fe extends further more 
than that of left wing. When scale £ w 10 3 bp, the shape 
of PDFs becomes nearly symmetrical and Gaussian-like 
(data not shown here) which indicates the vanishing of 
window size effects. We hereafter analyze natural DNA 
sequences beyond this scale. Careful calculation of PDFs 
make it possible to investigate the fluctuation structures 
at very large scales up to 10 5 bp, about 1/2 to 1/100 of a 
typical microbial genome. For eukaryotic genome such as 
the human chromosomes, larger scales are more practical 
to study. But for comparison, the scale ranges are fixed 
between 10 3 and 10 5 for all sequences studied below. 

The evolution of the PDFs of guanine density fluctua- 
tion within the scale range £ = 2 10 ~ 2 17 bp for the full 
length of the four sequences in Fig. ^ are shown in Fig. [21 
where only two scales £ m in = 2 10 and £ m ax — 2 17 are dis- 
played. With scale £ increasing, the distribution of tails 
progresses further, indicating the emergence of highly in- 
tense fluctuations in fe- The four sets of PDFs show 
distinct shapes. Both random and simulated sequences 
have the narrow shapes of PDFs, corresponding to their 
low fluctuation magnitudes. Moreover, the PDFs of sim- 
ulated sequence are also symmetrical as those of random 
sequence. Other two (Ecoli and Hsap4) are not sym- 
metrical. The right wings of the PDFs of Hsap4 are far 
more higher than the left with an exponential decaying 
tail, and those of Ecoli is rightly opposite. Such tenden- 
cies in the changes of PDFs can be well captured by our 
quantitative HS analysis below. 
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FIG. 3: SS and ESS plots with S 3 (£) vs. I and S 3 (£) vs. S 2 (£) 
respectively for (a) Random, (b) Simulation, (c) Ecoli and (d) 
Hsap4. The scale range is £ = 2 10 ~ 2 17 . Note the curve of SS 
plots means no absolute scaling, whereas the linearity of ESS 
plots means the validation of relative scaling property. The 
ESS scaling exponents are measured by a least square fitting. 

III. MEASUREMENTS 

A. Extended self-similarity analysis 

Denote by S p {£) the pth order moment of the fluctua- 
tion fe: 

S P (i) = iff) = J f!P(h)dfe, (2) 

where P(f e ) is the PDF of f e . For the calculation of PDF, 
we take the linear sequence as a circle, thus all bases in 
the sequence are treated equally (especially for large £). 
In fact, most prokaryotic genomes are indeed circular. 
For large linear eukaryotic chromosomes like those of H. 
sapiens (L 3> £), the choice of open or close boundary 
conditions gives essentially the same result. 

In previous studies, great efforts were given to explore 
the power law scaling properties of some quantities like 
S p (£) with respect to the length sale £: S p (£) ~ f v , called 
self-similarity (SS), where ( p is called the scaling expo- 
nents 13]. Consequently, a log- log plot of S p (£) versus £ 
will give a straight line with a slope ( p . The DFA method 
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[22j for analyzing "DNA walk" is such a SS approach. In 
many cases, however, such abstract scaling property does 
not hold well. Therefore, a more general scaling relation 
called extended self-similarity (ESS) has been intro- 
duced in turbulence field, which is widely valid even when 
SS property is not available. The existence of ESS im- 
plies that the moments of different orders have a common 
changing mode with respect to the length scales, thus the 
ESS is also called relative scaling, with the form like 



S p (£) 



(3) 



where is called the relative scaling exponents, log- 
log plots of Sa(£) vs. i for guanine with I ranging from 
10 3 to 10 5 bp are shown in Fig.^a), where the bended 
curves indicate the lack of power law scaling for all four 
sequences. Figure (b) displays plots with Sz{f) vs. 
Si{€), where the perfect linearity verifies the existence 
of the ESS property in the same scale range. Careful ex- 
amines for other S p (£) with higher order p (up to order 
p = 8) also show the validation of ESS (data not shown 
here). The relationship between scaling exponents 
with different orders p can be precisely predicted by the 
HS model as below. 



B. Hierarchical structure analysis 

The HS model was originally proposed by She and 
Leveque to describe inertial-range multi-scale fluctu- 
ations in terms of a similarity relation between structures 
of increasing intensities of successive moment-orders p 
in the hydrodynamics turbulence fluid. This new sim- 
ilarity relation as a generalization of the Kolmogorov's 
complete-scale-similarity was later developed as a HS 
theory |2^|, which has been successfully applied to an- 
alyze many turbulence related field, such as the Couette- 
Taylor flow [24|], flows in rapidly rotating disk (2^, the 
climate turbulence 12611 . astrophysical magnetohydrody- 
namic turbulence |27j |. and other various complex sys- 
tems, such as the diffusion-limited aggregates j2^|, the 
luminosit y fi elds of natural image |29| , chemical reaction 
patterns [3(J]. Preliminary analysis of the base density 
fluctuations at moderate length scales along microbial 
genomes |3l| has given also an encouraging sign that 
leads to the present work. 

The HS model introduces a hierarchy of functions for 
successive fluctuation intensities: 



S p+1 (£) _ J ff +1 P{h)df t 

s P {£) JffP(ft)<Vi 



= / hQ P (fi)dfe, 



where Q p (fe 



( 4 ) 

flfp(ftWt ' which is a wei g hted PDF 
for which fj, p (£) is the mathematical expectation. Such 
a hierarchy fj, p (£) covers the mean density fluctuation 
intensity fiQ, and a series of increasing hierarchical in- 
tensities with increasing order p, and finally approaches 
to the intensity of the so-called most intermittent struc- 



ture, Hoo{£) = limp_> 0O li p {£)- Therefore, one can asso- 
ciate each intensity with an appropriate order p which 
varies continuously from to infinity. The increasing hi- 
erarchical intensity reflects the increasing contribution of 
positive fluctuation events while reduce that of negative 
ones. When p is small (less than 10), the hierarchical 
function p p is dominated by the struggle of negative and 
positive components of bold fluctuations (which is pre- 
sented by the shape of PDFs). 

Note that fi p (£) is a function of both £ and p, which 
is an inherent merit of the HS model: both scales and 
intensities are related to describe the multi-scaling prop- 
erty of fluctuation structures. The HS model postulates 
a relation among various intensities, called HS similarity, 
with the form like: 



/Vt-i(l) _ Op f V P (£) \ 







(5) 



where the exponent (3 is a constant and a p is independent 
of I. The validation of the HS similarity relation Eq. (J5J 
can be tested by a so-called /3-test |2J, |33 , which says 
that a log-log plot of h p+ i(£)//j,i(£) vs. p p {£)/^{£) (often 
both items are normalized by the smallest scale £q) can 
be constructed, and the HS similarity is satisfied as long 
as a linearity is observed. Then the HS parameter f3 
can be obtained by measuring the slope using the least 
square fitting. Technically speaking, this completes the 
HS analysis of a given set fluctuation data. 

The HS theory can construct a scaling equation to pre- 
dict the ESS scaling exponents. Equation (J3J) leads to a 
general formula of the scaling exponents: 

c P .2 = iP+c(i-n, (6) 

where C = (1 — 2^)/{l — 1 ) is determined by £2,2 = 1- 
The parameter 7 is introduced to characterize the most 
intermittent structure: (J>oa(£) ~ S^- Note that Sq(£) 
1 , 5i (£) = Co where Cq is the average density, and both 
constants are independent of the scale £ and of S%{£), 
thus we have the exact results Co, 2 = 0, £1,2 = and 
£12 = 0. The first constraint is automatically satisfied by 
Eq. Q, but the second constraint introduces a relation 
between the parameter (3 and 7 to make only one of them 
independent. Therefore, the HS model here leaves only 
one free parameter to describe multi-scaling exponents 
of the nucleotide density fluctuations. An analysis shows 
M: 



(1 - /3)p - (I - p) 



(7) 



where 0^1. Note the situation of (3 — » I means no inter- 
mittency, because 7 will approximate to infinite. When 
(3=1 the (relative) scaling exponents will be a quadratic 
form Cp,2 = pip — l)/2 which can be exactly observed in 
a completely random DNA sequence. 

The HS similarity relation Eq. JSJ means that func- 
tions l^ p (£) obey a generalized similarity relation over 
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iog 10 [(M p (i)/^ (i))/(M p (i> (i ))] 



FIG. 4: The 0-test of guanine density fluctuation for Ran- 
dom, Simulation, Ecoli and Hsap4 at the range of 2 10 < I < 
2 17 and < p < 8. A straight line indicates the validity of 
the HS similarity. The slope /3 is estimated by a least square 
fitting. For clarity, the second, third and fourth set of data 
points are displaced vertically up by a suitable amount. 



FIG. 5: ESS scaling exponents £ Pi 2 of guanine density fluctu- 
ation for Random, Simulation, Ecoli and Hsap4. Dotted are 
the HS model formulas Eq. Q with (3 obtained from Fig. 0] 
Solid lines correspond to a reference (3 = 1: £p,2 = p(p — l)/2. 
Note that the HS model fits exactly the four sets of scaling 
exponents. 



a range of scales £ i < £ < £2 and over a range of in- 
tensities pi < p < p2- Such a HS similarity is an in- 
dication of the self-organization of the ensemble of the 
fluctuation events. When the HS similarity is presented, 
the parameter (3 measures the multi-scale, multi-intensity 
and self-organized property of the system [2j|. When 
(3 — » 1, the system are composed of completely self- 
similar structures. The corresponding physical picture is 
the Kolmogorov turbulence, where the large and small- 
scale statistics are completely self-similar. We will report 
below if an artificial DNA sequence is completely random, 
its base density fluctuations will belong to such case. The 
deviation of (3 from one means intermittency. Generally 
speaking, More departure the (3, more outstanding the 
most intermittent structures stand with respect to the 
background fluctuations. Therefore, the value of (3 is 
more intuitively related to the degree of intermittency. 



IV. HS ANALYSIS FOR GENOMIC DATA 

We conduct the HS analysis for variant kinds of 
organisms spread all over three kingdoms of species: 
Eukaryote with Homo sapiens (24 chromosomes) and 
Saccharomyces cerevisiae cerevisiae (16 chromosomes); 
Prokaryote with 16 Archaea complete genomes and 124 
Bacteria complete genomes/chromosomes; and 67 viruses 
complete genomes publicly available in the NCBI RefScq 
Release 3, January 30, 2004. For the H. sapiens genome, 
each fully sequenced chromosome composes a few "con- 
tigs" , which is a draft or finished sequence. Therefore, 
we select the longest contig (typically large than 10 mil- 



lion bp) in the set of H. sapiens chromosomes and ana- 
lyze them separately. The chromosomes of S. cerevisiae 
are completely sequenced, and thus each was analyzed 
independently. The 16 Archaea genomes, according to 
the Bergey's Manual of Systematic Bacteriology (2nd 
edition), are belong to two phyla: 4 of Crenarchaeota 
and 12 of Euryarchaeota [33| . The kingdom of Bacte- 
ria is divided into 23 phyla by Bergey's Manual. The 
124 species/strains studied here are spread over 13 phyla. 
Each species/strains is assigned by their 'Bergey Code' 
for classification as introduced by Qi et al. |34|. For 
example, Escherichia coli K12 is listed under Phylum 
BXII (Proteobacteria), Class III (Gammaproteobacte- 
ria), Order XIII (Enterobacteriales), Family I (Entcr- 
obacteriaceae) , Genus XIII (Escherichia). We change 
all Roman numerals to Arabic and write the lineage as 
B. 12. 3. 13. 1.13. The 67 viruses sequences studied are long 
enough (beyond 2 17 bp) for statistic analysis. Scales for 
analyzing the nucleotide fluctuations is consistent with 
the above, from 10 3 to 10 5 bp. For each sequence, four 
kinds of bases (adenine (A), cytosine (C), guanine (G) 
and thymine (T)) as four fundamental "words" in DNA 
sequences are analyzed independently. 

Most sequences reasonably pass the /3-test (with a cor- 
relation coefficient above 0.9995) for four different bases, 
thus HS parameter (3 can be obtained from the linear fit- 
ting. The measured (3 of four kinds of bases are listed in 
Table[I]^ Tablc[V] for viruses, Bacteria, Archaea, S. cere- 
visiae and H. sapiens, respectively. The [3 obtained from 
random sequence and simulated genome sequence with 
the minimal model is also listed in Table IVTI to be com- 
pared. Some other important information about these 
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FIG. 6: The log-log plot of (3 vs. V p /V„ for (a) Bacteria, (b) 
Archaea, (c) Human chromosomes and (d) viruses. V p and V„ 
is calculated in a window with 1024 bp. Note that j3 roughly 
increases with increased amounts V p /V„. Where symbols are: 
(•) A, (A) T, (o) C, (A), G. Dash lines indicate = 1. 



sequences, such as names, NCBI accession numbers (Acc. 
No.), Bergey codes, lengths, base compositions are also 
listed in the tables. Items are ordered in according to the 
Bergey code, so species/strains closely related are listed 
together. 



A. The meaning of parameter j3 

As an example, the results of the /3-test for Ran- 
dom, Simulation, Ecoli and Hsap4 are shown in Fig. 0] 
where the scale range is between 10 3 bp and 10 5 bp and 
£n = 1024. The exactly good linearity of plots for all four 
cases indicate that the HS similarity is satisfied, which 
means that all genomic sequences including the random 
one have a nicely self-organized HS scaling property. The 
values of G obtained are 0.99 ± 0.000, 0.98 ± 0.000, 
0.93 ± 0.001 and 1.09 ± 0.002 for Random, Simulation, 
Ecoli, and Hsap4, respectively. ESS relative scaling ex- 
ponents Cp.2 measured in Fig. [3| as a function of order 
p are plotted in Fig. |SJ where also presents the predic- 
tion of HS model Eq. with parameters (3 obtained 
in Fig. Good agreements between the data of scaling 
exponents (points) and HS model predictions (lines) are 
exactly established. The result of the random sequence 
analyzed shows that its scaling exponents have a theoret- 
ical quadratic form C, p ^ = pip — l)/2 with (3—1. Further- 
more, scaling exponents £ Pi 2 are distinctly separated into 



three groups: Random and Simulation with systemati- 
cally moderate ( Pi 2', Hsap4 with larger ones; Ecoli with 
smaller ones. Theoretical speaking, smaller ESS scaling 
exponents mean less heterogeneous and high intermit- 
tent corresponding to the smaller (3. Results of Fig. [5] 
are consistent well with this theoretical consideration. 

The quantitative f3 values are different for these four 
sequences. The i.i.d. random sequence has a (3 very close 
to 1 which is consistent with the self-similarity picture of 
both its base density fluctuations and the Gaussian-like 
PDFs. Simulated sequence with (3 close to 1 indicates 
that its base density fluctuations are very near to ran- 
dom. Interestingly, Ecoli and Hsap4 have different (3 val- 
ues with remarkable deviations from one, where (3g of 
Ecoli is lower than one and that of Hsap4 is on the op- 
posite. This contrast can also be seen from the opposite 
fluctuations of the two sequences in Fig.^ The deviation 
of (3 from one is consistent with the increasing of more 
intermittent structures presented in the guanine density 
fluctuations shown in Fig. ^ and Fig. |2 The (3 values 
measured here give a well quantitative description of this 
intermittent structures. 

Most (3 values for various genomes in Table [I] ~ Ta- 
ble significantly deviate from one, indicating a non- 
Gaussian statistical property of the base density fluctu- 
ations. Mathematically, the existence of (3 in a range 
l\ < t < £2 means that the incremental hierarchical in- 
tensity \x p (p=0,l,...) in this scaling range are linked by a 
similarity parameter p. fj, p is the mathematical expecta- 
tion of p order weighted PDF which is directly related to 
the shape of the original PDF (0-order). If the original 
PDF is skewed or peaked, the incremental rate of \x p will 
deviate from that of a Gaussian PDF, which will lead to 
the deviation of (3 from one. Thus (3 is closely related to 
the heterogeneity (caused by atypical base density) of the 
sequence. The measured f3, that is dependent on individ- 
ual organisms, reflects different fluctuation structures of 
the different genomes. 

For illustrating this point, we study the atypical com- 
ponents of the fluctuation signals. When atypical com- 
ponents are biased distribution, e.g., with more positive 
components than negative ones, the PDF is skewed with 
a long right tail. V p is introduced to measure the per- 
centage of the positive components beyond a threshold 
relative to the whole ensemble: 



Vn 



fP(f)df, 



(8) 



where / is the base composition measured in a fixed win- 
dow and /i is the mean value of /. Similarly, the percent- 
age of the negative components V n is defined as 



-H 



fP(f)df. 



(9) 



When calculating V p and V n , we fix the window length 
to be 1024 bp and let H be 1.5 times of standard devia- 
tion of /. For reasonable PDF (with a single maximum), 
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the skewness can be roughly characterized by the relative 
value of V p and V n . When V p is far more than V n , the 
PDF tends to have a long right tail (such as Fig.J^d)), 
and vice versa. We study the relationship between (3 and 
the biased fluctuation of local base density by a log-log 
plot (3 vs. Vp/V n in Fig.H3for Bacteria, Archaea, Human 
chromosomes and viruses respectively. Note that (3 and 
Vp/V n have the same tendency, which relate (3 with the 
biased distribution of atypical components. In a word, 
(3 measure the heterogeneity of a genome sequence. Al- 
though a theory called mutational equilibrium theory [35| 
for the interpretation of the stationarity of G+C con- 
tent within a species has been proposed, the understand- 
ing (both qualitative and quantitative) on base composi- 
tional heterogeneity is still limited. Hereafter we propose 
(3 as the "prob" to systematically study the genomic het- 
erogeneity. 

The analysis above is focused on the meaning of pa- 
rameter [3 in terms of fluctuations of base composition, 
which is a intuitional study on the physical picture of 
genome sequences. Such a physical study may have other 
biological implications. It should be emphasized that HS 
theory has an mathematical invariance (symmetry) by 
defining a transformation group [2^. Such a symmetry 
is exactly achieved through a log-Poisson cascade pro- 
cess |36j . When we carry out multiscaling and hierarchi- 
cal analysis of DNA, RNA and protein sequences, some 
transformations or symmetry may be useful to clarify the 
complexity of a biological system, especially genome data 
which combine the information of the structure and func- 
tion together. We think that these quantitative proper- 
ties, especially HS parameter (3, are revelatory to char- 
acterize biological questions and farther research should 
be done. In the following, we will expound two points 
respectively: DNA strand symmetry in subsec. TlV Bl and 
sequence complexity with (3 in subsec. II V CI 



B. Strand symmetry 

One of the most intriguing results obtained here is 
that base composition fluctuations of most Prokaryotic 
genomes and Eukaryotic chromosomes obey a parity rule: 
(3a ~ (3t and (3c ~ (3g- This is the extension of Char- 
gaff's parity rule 2 (PR2), which states that if single 
strands of a long DNA duplex (say, a few thousand bp) 
are isolated and their base compositions are determined, 
then %A Si %T, and %C Si %G H3- The validity of 
PR2 became clearer when full genome sequences are cal- 
culated. PR2 can be generalized to compositions of dinu- 
cleotide and other oligonucleotide 1381 . or even the whole 
base- base correlation function fl3l l39l| . PR2 is generally 
interpreted as the strand symmetry of biological func- 
tionalities such as mutation and/or selection pressures 
acting on single base or oligonucleotide. Local asymmet- 
rical base composition is also numerously reported |40| . 
As illustrated in Fig [7\ (Top), the local base density of 
A(G) and T(C) is not equal, but enantiomorphous each 
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FIG. 7: (Top) Local density fluctuations with a scaling win- 
dow of £ = 10 4 bp of four bases (a) A, (b) T, (c) G and (d) 
C for B-Xfa. The sliding window moves at a step of length 
A = 10 3 bp. Note the approximately enantiomorphous fluc- 
tuation between A and T, G and C; complementary property 
of base fluctuation between A and G, T and C. (Bottom) 
/3-test of base density fluctuation for B-Xfa at the range of 
2 10 < I < 2 17 and < p < 8. A straight line indicates 
the validity of the HS similarity. Note that the parity rule 
Pa ~ (3t and (3c ~ Pa is obeyed. For clarity, the second, 
third and fourth set of data points are displaced vertically up 
by a suitable amount. 



other. However, we find the (3 values of each pair are 
very close. The parity rule in terms of fluctuation dis- 
covered here means that when full genomic sequences 
are considered, fluctuation structures of A(C) and T(G) 
are approximately identical, although the local fluctua- 
tion at the same position may be different. This point 
is well illustrated in Fig. (Bottom) in spit of remark- 
able out-of-phase fluctuations. The existence of PR2 in 
terms of HS parameter (3 means that the global function 
of evolutionary factors, such as mutation pressure or se- 
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FIG. 8: The symmetry levels Sp measured on the base (3 
values of genomes or chromosomes are increased with the in- 
creasing sequence length. 

lection pressure, are not bias on a genomic scale. This 
finding is consistent with our previous results [l^, and 
such a biological implication may be deserved to study 
in the future. Recently ref. jllj introduced a similarity 
function to describe the strand symmetry, which has the 
form like: 

el _ I _ I/a - Jr| + \fc - /g rin . 

\fA+f T \ + \fc + f G \' [ ' 

where fi with i = {A, T, C, G} denote the density of 
any single nucleotide. S 1 can be used to character- 
ize the symmetry level with a range from (asymme- 
try/dissimilarity) to 1 (prefect symmetry/similarity) (de- 
tails can be found in Ref. We calculate Slon the 
(3 of four bases, and display the results in Fig. [§1 where 
the symmetry levels roughly increase with the increasing 
of sequence length. 

C. Sequence complexity 

Another intriguing results is the systematic change of 
(3 with evolutionary categories. The relationship between 
evolutionary categories and sequences correlation struc- 
tures have been studied previously in 0, |2^, E3| ■ Note 
that Buldyrev et al. [llj suggested that the complexity of 
noncoding DNA sequences increased with evolution, with 
an increasing of spectrum exponents for highly evolved 
species. While Voss jl3 found the spectrum exponents 
decrease with evolution. These incompatible findings are 
due to the equivocal meaning of spectrum exponents. We 
have shown that HS parameter (3 has an implication of 
biological evolution [17| . that is, the decrease of category 
averaged (3a reflects the increasing degree of organiza- 
tion in more developed species. As shown in Sec. IIV Al 



FIG. 9: The mean values of (j3 c +/3a)/2 versus ((3 a +(3t)/2 
for 124 bacterial genomes/chromosomes, 16 archaeal genomes, 
Yeast, Human and Viruse. The solid points are the mean 
values Note the cluster property of the three kingdoms and 
the diversity of viruses. 

we related the decrease of (3 (of a specific base) with the 
increasing sequence heterogeneity introduced by concen- 
tration of low-density base compositions. 

We introduce a new definition of sequence complex- 
ity as the total heterogeneity of the four different bases. 
Because of strand symmetry, we reduce the number of 
variables from four to two by setting (3 s = {(3c + /?g)/2 
and/3 w = ((3a+(3t)/2. Then the quantitative expression 
for sequence complexity is written as: 

SC=\(3 S -(3 W \. (11) 

For a random sequence, SC — because in that case 
(3a ~ (3g- R establish a zero complexity for random 
sequences. Complexity of simulated sequence generated 
by the minimal model |2Cj is nearly zero. The SC values 
for different categories, which is shown in Fig. El indicate 
that Human has the highest complexity and Eukaryotes 
has a higher complexity than Prokaryotes. Interestingly, 
on average Archaea is more complex than Bacteria. It 
is not clear if this is because the number of bacterial 
genomes studied is sufficient to get a statistical average 
while that of archaeal genomes is biased by its relatively 
small number of members. But it is remarkable that 
in this finite set both (3 and (3 W of archaea genomes 
are lower than most of bacteria genomes. This is an 
interesting phenomenon which needs further study. Virus 
are very diverse: their (3 values have a big variance. This 
may be related to their variability nature. 

The relative magnitude of (3 s and (3 W is also needs 
further investigation. From Fig. [5] it is clear that for 
Human (3s > (3wi while Archaea and Bacteria are on 
the contrary. It indicates the presence of many low- 
concentration regions of C or G (and hence a high concen- 
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tration of A or T) in the genomes of prokaryotic genomes. 
While many regions with high concentration of C or G 
are spread along the Human genome. 

One plausible origin of sequence complexity is hori- 
zontal gene transfer (HGT), which has been recognized 
as one of the major forces in prokaryotic genome evolu- 
tion 0| . HGT increase the heterogeneity by incorporat- 
ing alien sequences, because recent transferred sequences 
from distantly related species have not undergone suffi- 
cient mutational pressure, thus its atypical base compo- 
sition can be distinguished from ancestral DNA [Ij, 1451 . 
According to this criteria, Garcia- Vallve et. al. |45| 
found that 0% to 22.2% of total genes of 88 bacterial 
and archaeal genomes are obtained by horizontal gene 
transfer. The guanine density fluctuation and /3-test of 
three typical cyanobacteria (Thermosynechococcus elon- 
gatus BP-1 (BP-1), Synechocystis PCC6803 (PCC6803) 
and Synechococcus sp. WH8102 (WH8102)) are shown 
in Fig. EH WH8102 has a notable small (3 and exten- 
sive low-guanine regions comparing to the other two. 
Indeed, a lot of low G+C segments of the genome of 
WH8102 have been comprehensively identified as ob- 
tained by HGT 46] , contributing to its functionalization 
of the envelope-modification of the cell surface and the 
motility of swimming. 



V. CONCLUSIONS 

Our approach by HS analysis have some merits as the 
following: first, it has a solid theoretical foundation and 
has obtained many experimental supports, and varies 
models deduced from the HS theory has been widely 
used to analyze nonlinear fields containing intermittent 
structures; second, It employs an extended scaling anal- 
ysis which lead to more accurate identification of scaling 
property; third but not least, the multiple scale fluctu- 
ation analysis of base composition is adequate to detect 
large scale (up to 10 5 bp) correlations. 

By carrying out an systematic study of the large scale 
structures of available genomes, we show that the large 
scale base density fluctuation (10 3 — 10 5 bp) of most ex- 
amined sequences (including genomes of Archaea, Bac- 
teria, Eukaryotes and viruses) satisfy the HS similar- 
ity relation. It reveals that base density fluctuations of 
genomes are hierarchically organized across scales and 
across intensities. The HS analysis (/3-test) allows one 
to quantify the degree of multi-scale and multi-intensity 
correlations. It is known that the major contributing 
factors to the sequence-wide pattern is not the mean 
base content and correlations among neighboring bases 
in genome sequence, but the spatial heterogeneity of the 
base composition fluctuation or the long-range correla- 
tion that largely shapes the complexity of the whole se- 
quence. Our HS parameter (3 obtained can capture this 
point. Furthermore, (3 is effective to describe horizontal 
gene transfer, strand symmetry, phylogenetic relations of 
various biological species. 
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FIG. 10: (Top) Local density fluctuations with a scaling 
window of i = 10 4 bp of G for (a) BP-1, (b) PCC6803 and (c) 
WH8102. The sliding window moves at a step of length A = 
10 3 bp. (Bottom) /3-test for BP-1, PCC6803 and WH8102. 
The test range is 2 10 < I < 2 17 and < p < 8. A straight line 
indicates the validity of the HS similarity. Note the smaller 
value of (5 for WH8102 corresponds to many low-concentration 
regions of G. 



It is shown the values of (3 for natural DNA sequences 
show distinct deviation from one which is illustrated as 
the case of a completely random one. [3 values show sig- 
nificant divergence, but preserve the parity among bases 
presumably the consequence of strand symmetry. The 
HS parameter (3 are clustered according to evolution cate- 
gories. It indicate that spatial heterogeneity of base com- 
position or long-range correlation in genomic sequences 
are different for Archaea, Bacteria and Eukaryotes. The 
heterogeneity is interpreted as different genetic material 
transfer patterns for evolutionary communities. 
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APPENDIX A: ABBREVIATIONS 

The abbreviations of the sequence name are alphabet- 
ically listed in the parenthesis: 

(a) Table HJ Virus (Most abbreviations are taken 
form http://www.ncbi.nlm.nih.gov/ICTVdb): African 
swine fever virus (V-AsFV), Agrotis segetum gran- 
ulovirus (V-AsGV), Amsacta moorei entomopoxvirus 
(V-AmEPV), Autographa californica nucleopolyhc- 
drovirus (V-AcNPV), Bovine herpesvirus 1 (V-BoHV- 
1), Bovine herpesvirus 5 (V-BoHV-5), Bovine papu- 
lar stomatitis virus (V-BPSV), Callitrichine herpesvirus 
3 (V-CalHV-3), Camelpox virus (V-CMLV), Canary- 
pox virus (V-CNPV), Cercopithecine herpesvirus 1 
(V-CeHV-1), Chimpanzee cytomegalovirus (V-CCMV), 
Choristoneura fumiferana defective nucleopolyhedrovirus 
{V-CfDEFNPV), Cowpox virus (V-CPXV), Ectocarpus 
siliculosus virus (V-EsV), Ectromelia virus (V-ECTV), 
Equine herpesvirus 1 (V-EHV-1), Equine herpesvirus 2 
(V-EHV-2), Equine herpesvirus 4 (V-EHV-4), Fowlpox 
virus (V-FWPV), Gallid herpesvirus 2 (V-GaHV-2), 
Gallid herpesvirus 3 (V-GaHV-3), Goatpox virus (V- 
GTPV), Helicoverpa armigera nucleopolyhedrovirus G4 
(V-HaNPV), Heliothis zea virus 1 (V-HzV-1), Human 
herpesvirus 1 (V-HHV-1), Human herpesvirus 2 (V- 
HHV-2), Human herpesvirus 4 (V-HHV-4), Human 
herpesvirus 5 (V-HHV-5), Human herpesvirus 6 (V- 
HHV-6), Human herpesvirus 6B (V-HHV-6B), Human 
herpesvirus 7 (V-HHV-7), Human herpesvirus 8 (V- 
HHV-8), Ictalurid herpesvirus 1 {V-IcHV-1), Inverte- 
brate iridescent virus 6 (V-IIV-6), Lumpy skin dis- 
ease virus (V-LSDV), Lymantria dispar nucleopolyhe- 
drovirus (V-LdNPV), Lymphocystis disease virus - iso- 
late China ( V-LCDV), Macaca mulatta rhadinovirus ( V- 
MMRV), Mamestra configurata nucleopolyhedrovirus A 
(V-MacoNPV-A), Mamestra configurata nucleopolyhe- 
drovirus B (V-MacoNPV-B), Melanoplus sanguinipes 
entomopoxvirus (V-MsEPV), Meleagrid herpesvirus 1 
(V-MeHV-1), Molluscum contagiosum virus (V-MCV), 
Monkeypox virus (V-MPXV), Mouse cytomegalovirus 
1 (V-MCMV-1), Myxoma virus ( V-MYXV), Orf virus 
(V-ORFV), Orgyia pseudotsugata multicapsid nucle- 
opolyhedrovirus (V-OpMNPV), Ostreid herpesvirus 1 
(V-OsHV-1), Paramecium bursaria Chlorclla virus 1 
(V-PBCV-1), Psittacid herpesvirus 1 (V-PsHV-1), 
Rabbit fibroma virus (V-SFV), Rabbitpox virus (V- 
RPXV), Rachiplusia ou multiple nucleopolyhedrovirus 
( V-RoMNPV), Rat cytomegalovirus ( V-RCMV), Sheep- 
pox virus ( V-SPPV), Shrimp white spot syndrome virus 



( V-SWSSV), Spodoptera exigua nucleopolyhedrovirus 
( V-SpeiNPV), Spodoptera litura nucleopolyhedrovirus 
{V-SpltNPV), Swinepox virus (V-SWPV), Tupaia her- 
pesvirus (V-TuHV), Vaccinia virus (V-VACV), Vari- 
ola virus (V-VARV), Xestia c-nigrum granulovirus (V- 
XecnGV), Yaba monkey tumor virus ( V- YMTV), Yaba- 
like disease virus (V-YLDV). 

(b) Table ^ Bacteria: Agrobacterium tumefaciens 
strain C58 C & L (B-Atul & B-Atu2), Aquifex aeoli- 
cus (B-Aae), Bacillus anthracis A2012 (B-Ban), Bacil- 
lus halodurans (B-Bha), Bacillus subtilis [B-Bsu), Bac- 
teroides thetaiotaomicron VPI-5482 (B-Bth), Bifidobac- 
terium longum NCC2705 (B-Blo), Bordetella bronchisep- 
tica (B-Bbr), Bordetella parapertussis (B-Bpa), Borde- 
tella pertussis (B-Bpe), Borrelia burgdorferi (B-Bbu), 
Bradyrhizobium japonicum (B-Bja), Brucella melitensis 
chromosome I & II (B-Bmel Brucella suis chromosome I 
& II (B-Bsul & B-Bsu2), & B-Bme2), Buchnera aphidi- 
cola (B-Bap), Buchnera aphidicola Sg (B-BapS), Buch- 
nera sp. APS (B-Bsp), Campylobacter jejuni (B-Cje), 
Caulobacter crescentus (B-Cre), Chlamydia muridarum 
(B-Cmu), Chlamydia trachomatis (B-Ctr), Chlamy- 
dophila caviae GPIC {B-Cca), Chlamydophila pneumo- 
niae AR39, CWL029, J138 & TW-183 (B-Cpnl, B- 
Cpn2, B-Cpn3 & B-Cpn4), Chlorobium tepidum TLS 
(B-Cte), Chromobacterium violaceum ATCC 12472 [B- 
Cvi), Clostridium acetobutylicum ATCC824 (B-Cac), 
Clostridium perfringens (B-Cpe), Clostridium tetani 
E88 (B-Cte), Corynebacterium efficiens YS-314 (B- 
Cef), Corynebacterium glutamicum (B-Cgl), Coxiella 
burnetii (B-Cbu), Deinococcus radiodurans chromosome 
1 & 2 (B-Dral & B-Dra2), Escherichia coli CFT073, 
K12, 0157:H7 & 0157:H7 EDL933 (B-Ecol, B-Eco2, 
B-Eco3 & B-Eco4), Fusobacterium nucleatum ATCC 
25586 (B-Fnu), Haemophilus ducreyi 35000HP (B-Hdu), 
Haemophilus influenzae Rd (B-Hin), Helicobacter hep- 
aticus (B-Hhe), Helicobacter pylori 26695 & J99 (B- 
Hpyl & B-Hpy2), Lactobacillus plantarum (B-Lpl), Lac- 
tococcus lactis sp. IL1403 (B-Lla), Leptospira interro- 
gans I & II (B-Linl & B-Lin2), Listeria innocua (B- 
Lin), Listeria monocytogenes EGD-e (B-Lmo), Mesorhi- 
zobium loti (B-Mlo), Mycobacterium bovis subsp. bo- 
vis AF2122/97 (B-Mbo), Mycobacterium leprae TN (B- 
Mle), Mycobacterium tuberculosis CDC1551 & H37Rv 
(B-Mtul & B-Mtu2), Mycoplasma gallisepticum (B- 
Mga), Mycoplasma genitalium (B-Mge), Mycoplasma 
penetrans (B-Mpe), Mycoplasma pneumoniae (B-Mpn), 
Mycoplasma pulmonis UAB CTIP (B-Mpu), Neisseria 
meningitidis MC58 & Z2491 (B-Nmel k, B-Nme2), Nos- 
toc sp. PCC7120 (B-Nsp), Oceanobacillus iheyensis (B- 
Oih), Pasteurella multocida PM70 (B-Pmu), Pirellula 
sp. {B-Psp), Porphyromonas gingivalis W83 (B-Pgi), 
Prochlorococcus marinus CCMP1375, CCMP1378 & 
MIT9313 (B-Pmal, B-Pma2 & B-Pma3), Pseudomonas 
aeruginosa PA01 (B-Pae), Pseudomonas putida KT2440 
(B-Ppu), Pseudomonas syringae (B-Psy), Rickettsia 
conorii (B-Rco), Rickettsia prowazekii {B-Rpr), Ralsto- 
nia solanacearum chromosome (B-Rso), Salmonella ty- 
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phi (B-Styl), Salmonella typhimurium LT2 (B-Sty2), 
Salmonella typhi y2 (B-Sty3), Sinorhizobium meliloti 
1021, pSymA & pSymB (B-Smel, B-Sme2 & B-SmeS), 
Shewanella oneidensis MR-1 {B-Son), Shigella flexneri 
2a strain 301 (B-Sfl), Staphylococcus aureus Mu50, MW2 
& N315 (B-Saul, B-Saul & B-Saul), Staphylococcus 
epidermidis ATCC 12228 (B-Sep), Streptococcus agalac- 
tiae 2603 V/R & NEM316 (B-Sagl & B-Sag2), Strepto- 
coccus mutans UA159 (B-Smu), Streptococcus pneumo- 
niae R6 & TIGR4 (B-Spnl & B-Spn2), Streptococcus 
pyogenes MGAS8232, MGAS315, SF370 & SSI-1 (5- 
B-Spy2, B-Spy3 & B-Spy4), Streptomyces coeli- 
color A3(2) (B-Sco), Streptomyces avermitilis MA-4680 
(B-Sav), Synechococcus sp. WH8102 (B-SspW), Syne- 
chocystis sp. PCC6803 (B-SspP), Treponema pallidum 
(B-Tpa), Thermoanaerobacter tengcongensis (B-Tte), 
Thermo synechococcus elongatus BP-1 (B-Tel), Thermo- 
toga maritima (B-Tma), Ureaplasma urealyticum (B- 
Uur), Vibrio cholerae chromosome 1 & 2 (B-Vchl & 



B-Vch2), Vibrio parahaemolyticus RIMD 2210633 chro- 
mosome 1 & 2 (B-Vpal & B-Vpa2), Vibrio vulnificus 
CMCP6 chromosome I & II (B-Vvul & B-Vvu2), Wig- 
glesworthia brevipalpis (B-Wbr), Wolinella succinogenes 
(B-Wsu), Xanthomonas axonopodis citri 306 (B-Xax), 
Xanthomonas campestris ATCC 33913 (B-Xca), Xylella 
fastidiosa (B-Xfa), Yersinia pestis strain C092 & KIM 
(B-Ypel & B-Ype2). 

(c) Table IIIII Archaea: Aeropyrum pernix (A- 
Ape), Archaeoglobus fulgidus (A-Afu), Halobacterium sp. 
NRC-1 (A-Hsp), Methanobacterium thermoautotroph- 
icum (A-Mth), Methanococcus jannaschii (A-Mja), 
Methanopyrus kandleri AV19 (A-Mka), Methanosarcina 
acetivorans (A-Mac), Methanosarcina mazei Goel (^4- 
Mma), Pyrococcus abyssi (A-Pab), Pyrococcus furiosus 
(A-Pfu), Pyrococcus horikoshii (A-Pho), Sulfolobus sol- 
fataricus (A-Sso), Sulfolobus tokodaii (A-Sto), Thermo- 
plasma acidophilum (A-Tac), Thermoplasma volcanium 
(A-Tvo). 
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TABLE I: Information of virus complete sequences. 



Virus Length A% C% G% T% f3 A (3 C Pa §r Sj aae Sj eta Acc. No. 



V-AsFV 


170101 


0. 


304 





,194 





,195 





,307 





.926 





.890 





.909 





,964 





,996 





.985 


NC.001659 


V-AsGV 


131680 


0. 


309 





,182 





,191 





,318 





.947 





.938 


1 


,058 


1 


,069 





,982 





.940 


NC.005839 


V-AmEPV 


232392 


0. 


405 





,090 





,088 





,417 





.962 





.970 





,966 





,928 





,986 





.990 


NC_002520 


V-AcNPV 


133894 





293 





,203 


0, 


,204 





.300 





.984 





.952 


1 


,019 





,949 





.992 





.974 


NC_001623 


V-BoHV-1 


135301 


0. 


135 





,359 





,365 





,140 





.899 


1 


.093 


1 


,043 





,925 





.989 





.981 


NC_001847 


V-BoHV-5 


138390 


0. 


124 





,372 


0, 


,376 





,128 


1. 


.025 


1. 


.047 





,964 


1 


,028 





.992 





.979 


NC_005261 


V-BPSV 


134431 


0. 


.178 





,322 





,323 


0, 


,177 


1 


.029 





.933 





,895 


1 


,118 





,998 





.968 


NC_005337 


V-CalHV-3 


149696 


0. 


262 





,247 





,245 





,245 





.805 


1. 


.044 


1 


,025 





,838 





,981 





.986 


NC.004367 


V-CMLV 


205719 


0. 


.336 





,166 





,166 


0, 


,332 





.941 


1. 


.009 





,987 





,962 





,996 





,989 


NC.003391 


V-CNPV 


359853 


0. 


.352 





,152 





,152 





,344 





.967 





.984 


1 


,044 





,918 





,992 





.972 


NC.005309 


V-CeHV-1 


156789 


0. 


.128 





,369 





,375 





,127 


1. 


.074 


1. 


.048 


1 


,043 


1 


,055 





.993 





.994 


NC_004812 


V-CCMV 


241087 


0. 


192 





,307 





,310 





,191 





.996 





.886 





,869 





,971 





.996 





.989 


NC_003521 


V-CfDEFNPV 


131158 





270 





,230 


0, 


,229 





,271 


1. 


.064 





.964 





,968 





990 





,998 





.980 


NC.005137 


V-CPXV 


224501 





333 


0. 


,168 





,166 


0, 


,333 





.964 


1 


.025 


1 


,000 





,978 





,998 





.990 


NC.003663 


V-EsV 


335593 





244 





,258 


0, 


.260 





,238 





.991 





.999 


1 


,170 





,950 





,992 





.948 


NC.002687 


V-ECTV 


209771 


0. 


335 





,167 





,165 





,334 





.952 


1. 


.022 





,977 





969 





,997 





.984 


NC_004105 


V-EHV-1 


150223 


0. 


217 





,287 


0, 


,279 





,216 





.823 


1. 


.074 


1 


,094 





,808 





,991 





.991 


NC.001491 


V-EHV-2 


184427 


0. 


.216 





,293 





.282 





,209 





.941 





.907 





,875 





,975 





,982 





.982 


NC.001650 


V-EHV-4 


145597 


0. 


.249 





,254 





,251 





,247 





.849 


1. 


.080 


1 


,090 





,842 





,995 





.996 


NC_001844 


V-FWPV 


288539 


0. 


348 





,154 





,154 





,343 





.928 





.984 


1 


,044 





,959 





,995 





.977 


NC_002188 


V-GaHV-2 


138675 


0. 


283 





,215 





,214 





,287 





.846 


1. 


.164 


1 


,165 





,869 





,995 





.994 


NC_002229 


V-GaHV-3 


164270 


0. 


230 





,269 





,267 





,234 





.797 


1 


.086 


1 


,090 





,771 





,994 





,992 


NC.002577 


V-GTPV 


149599 





380 





,124 





,129 





,367 





.955 





.909 





,941 





,957 





,982 





.991 


NC.004003 


V-HaNPV 


131403 





301 





,194 





,196 





,309 





.946 





.961 





,978 





,988 





,990 





.985 


NC.002654 


V-HzV-1 


228089 


0. 


.288 





,211 


0, 


.208 





293 





.951 





.922 





,905 





,934 





,992 





.991 


NC_004156 


V-HHV-1 


152261 


0. 


159 





,338 





,345 





,158 





.833 


1. 


.203 


1 


,181 





,820 





,992 





.991 


NC.001806 


V-EHV-2 


154746 





149 


0. 


350 





,353 


0, 


,147 





.873 


1. 


.109 


1 


,076 





,832 





,995 





.981 


NC_001798 


V-HHV-4 


172281 





198 





305 





,295 





,203 





.877 





.949 


1 


,117 





,958 





,985 





.936 


NC_001345 


V-HHV-5 


230287 


0. 


.216 





,283 





.289 





,212 





.981 





.932 





,889 





,976 





,990 





.987 


NC_001347 


V-HHV-6 


159321 


0. 


.289 





,217 


0, 


,208 





,287 





.876 





.946 





,924 





,868 





,989 





.992 


NC.001664 


V-HHV-6B 


162114 


0. 


287 





,217 





,211 





,286 





.859 





.955 





.965 





,828 





993 





.989 


NC.000898 


V-HHV-7 


144861 





324 





,181 





,172 


0, 


,322 





.912 


1. 


.001 


1 


.091 





,824 





,989 





.954 


NC_001716 


V-HHV-8 


137508 





237 





,275 


0, 


,260 


0, 


,228 


1. 


.091 


1. 


.021 


1 


.046 


1 


,011 





,976 





.975 


NC.003409 


V-IcHV-1 


134226 


0. 


214 


0. 


,281 


0, 


,281 





,224 





.887 





.940 


1 


,001 


1 


,075 





,990 





.936 


NC_001493 


V-IIV-6 


212482 


0. 


352 





,148 


0, 


,139 





,362 





.960 


1. 


.000 





,978 


0, 


.933 





,981 





.987 


NC.003038 


V-LSDV 


150773 


0. 


376 


0. 


,127 





,132 





,364 





.957 





.915 





,941 


0, 


.957 





,983 





.993 


NC.003027 


V-LdNPV 


161046 


0. 


.213 





,287 





.288 





,213 





.969 





.919 





,954 





,964 





,999 





.989 


NC_001973 


V-LCDV 


186250 


0. 


.363 


0. 


,135 





.138 


0, 


,365 





.951 


1. 


.023 


1 


,136 





,986 





,995 





,964 


NC.005902 


V-MMRV 


133719 


0. 


245 





,267 





,258 





,230 





.896 





.982 


1 


,177 





,971 





,976 





.933 


NC_003401 


V-MacoNPV-A 


155060 


0. 


292 





,207 





,209 





,291 





.958 





.965 





,984 


1 


,015 





,997 





.981 


NC.003529 


V-MacoNPV-B 


158482 


0. 


302 





,199 





,202 





,298 





.978 





.960 





,985 





,988 





,993 





.991 


NC_004117 


V-MsEPV 


236120 


0. 


.407 


0. 


,092 





,091 





,410 





.980 





.956 





,947 





,945 





996 





.989 


NC.001993 


V-MeHV-1 


159160 


0. 


260 


0. 


,238 





,238 





,265 





.884 


1 


.052 


1 


,049 





,878 





,995 





.998 


NC_002641 


V-MCV 


190289 


0. 


184 





,315 





,318 





,182 


1. 


.021 





.929 





,930 





.985 





,995 





.990 


NC_001731 


V-MPXV 


196858 


0. 


335 





,166 





,165 





,334 





.942 


1 


.001 





996 





.956 





,998 





.995 


NC.003310 


V-MCMV-1 


230278 


0. 


.204 





.292 





,295 





,209 


1 


.024 





.939 





,883 





,981 





,992 





.974 


NC_004065 


V-MYXV 


161773 


0. 


.287 





,217 





,219 





,278 





.941 


1. 


.026 


1 


,018 





,950 





,989 





.996 


NC_001132 


V-ORFV 


139962 





.184 





,318 





,316 


0, 


,181 


1 


.021 





.877 





,812 


1 


,069 





,995 





.970 


NC.005336 


V-OpMNPV 


131995 


0. 


223 





,276 





,275 





,225 





.958 





.907 





,976 


1, 


,238 





,997 





.914 


NC_001875 


V-OsHV-1 


207439 


0. 


314 





,192 





,195 





,298 





.889 


1 


.033 


1 


,017 





,909 





,981 





.991 


NC.005881 


V-PBCV-1 


330743 





300 





,201 


0, 


,198 





,300 





.967 


1. 


.063 


1 


,062 





,931 





,997 





.991 


NC_000852 


V-PsHV-1 


163025 


0. 


193 





,308 





,301 





,198 





.898 


1 


.060 


1 


,070 





,937 





,988 





.988 


NC_005264 


V-SFV 


159857 


0. 


306 





,196 





,199 





,298 





.965 


1 


.035 


1 


,021 





,953 





,989 





.993 


NC_001266 
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TABLE I: (Continued) 



Virus 


Length 


A% 


C% 


G% 


T% 


Pa 


Pc 


Pa 


Pt 


S 


1 

base 


S 


.1 

beta 


Acc. No. 


V-RPXV 


197731 





332 


0, 


168 


0. 


167 


0.333 


0.952 


1. 


.015 


1 


.006 





.974 





.998 





.992 


NC.005858 


V-RoMNPV 


131526 


0. 


302 


0. 


195 


0, 


196 


0.307 


0.995 





.967 


1 


.036 





.951 





.994 





.971 


NC.004323 


V-RCMV 


230138 


0. 


.194 


0. 


301 


0. 


309 


0.196 


0.786 





.836 





.716 





.761 





.990 





.953 


NC_002512 


V-SPPV 


149955 


0. 


381 


0. 


123 


0. 


127 


0.369 


0.951 





.917 





.938 





.956 





.984 





.993 


NCL004002 


V-SWSSV 


305107 


0. 


.302 


0, 


205 


0, 


205 


0.288 


0.980 





.972 


1 


.170 





.918 





.986 





.936 


NC_003225 


V-SpeiNPV 


135611 


0. 


.283 


0. 


217 


0. 


221 


0.278 


0.937 


1. 


.012 





.991 





.966 





.991 





.987 


NC_002169 


V-SpltNPV 


139342 


0. 


281 


0. 


213 


0, 


215 


0.291 


0.953 





.906 





.941 





.968 





.988 





.987 


NC.003102 


V-SWPV 


146454 


0. 


366 


0. 


136 


0. 


138 


0.360 


0.958 





.964 





.985 





.956 





.992 





.994 


NC.003389 


V-TuHV-1 


195859 


0. 


166 


0. 


327 


0. 


340 


0.168 


0.966 





.901 





.850 





.956 





.985 





.983 


NC.002794 


V-VACV 


191737 





333 


0. 


167 


0. 


167 


0.333 


0.986 





.963 





.984 





.984 


1 


.000 





.994 


NC.001559 


V-VARV 


185578 





338 


0, 


164 


0. 


163 


0.334 


0.945 


1 


.019 





.998 





963 





.995 





.990 


NC.001611 


V-XecnGV 


178733 


0. 


297 


0. 


202 


0. 


205 


0.296 


0.983 





.974 





.935 


0. 


.954 





.996 





.982 


NC_002331 


V-YMTV 


134721 


0. 


355 


0. 


148 


0. 


150 


0.347 


0.941 


1. 


.177 


1 


.216 





.949 





.990 





.989 


NC.005179 


V-YLDV 


144575 


0. 


.367 


0, 


134 


0. 


136 


0.363 


0.952 


1 


.153 


1 


.150 





.946 





.994 





.998 


NC_002642 



TABLE II: Information of Bacteria. The Bergey Code is a shorthand of the lineage of the organism according to the order: 
phylum, class, order, family, genus. For species/strains (sp/str) belong to the fourteenth phylum, the subclass and suborder is 
also given. Items are ordered by the Bergey code, so species/strains closely related are listed together. 



Beygey code 


Sp/str 


Length 


A% 


C% 


G% 


T% 


Pa 


i 


8c 


i 


3g 


Pt 


S 


1 

base 


S 


rl 

beta 


Acc. No. 


B. 


.1.1.1.1.1 


B-Aae 


1551335 


0. 


.284 





.217 





,218 





,281 





.888 





.976 





.976 





.901 





996 





.997 


NC_000918 


B. 


.2.1.1.1.1 


B-Traa 


1860725 


0. 


.270 





.228 





,235 





,268 





.907 





.951 





.928 





.905 





,991 





.993 


NC.000853 


B. 


.4.1.1.1.1 


B-Dral 


2648638 


0. 


.165 





.335 





,335 





,165 


1. 


.118 





.982 





.962 


1 


.053 


1, 


,000 





.979 


NC_001263 








B-Dra2 


412348 


0. 


.170 





.333 





,334 


0, 


,164 


1. 


.132 





.877 





.925 


1 


.173 





,993 





.978 


NC.001264 


B. 


.10.1 




B-Tel 


2593857 


0. 


.231 


0. 


.269 





,270 





,230 


1 


.042 





.970 





.993 


1. 


,012 





,998 





.987 


NC_004113 


B. 


.10.1.1.1. 


11 


B-Pmal 


1751080 


0. 


.319 





,182 





,182 





,317 





.971 





.938 


1 


.037 





,978 





998 





.973 


NC_005042 








B-Pma2 


1657990 


0. 


.345 





.155 





,153 


0, 


,347 





.958 





.928 


1 


.062 





,957 





,996 





.965 


NC_005072 








B-Pma3 


2410873 


0. 


.256 





.262 





,245 





,236 





.974 





.884 





.934 





,977 





.963 





.986 


NC.005071 


B. 


.10.1.1.1. 


.13 


B-SspW 


2434428 


0. 


.202 





.297 





,297 





,204 


1 


.077 





.843 





.822 


1 


,085 





.998 





.992 


NC.005070 


B. 


10.1.1.1. 


14 


B-SspP 


3573470 


0. 


.261 


0. 


.238 





,239 





,262 


1 


.055 





.938 





.927 


1 


,068 





998 





.994 


NC_000911 


B. 


10.1.4.1. 


.8 


B-Nsp 


6413771 


0. 


.293 





.206 





,207 





,294 





.985 





.989 


1 


.006 





,975 





,998 





.993 


NC_003272 


B. 


.11.1.1.1. 


1 


B-Cte 


2154946 


0. 


.219 





.284 





,281 





,216 


1. 


.062 





.929 





.923 


1 


,057 





,994 





.997 


NC_002932 


B. 


.12.1.2.1. 


.1 


B-Rco 


1268755 


0. 


.337 





,161 





,163 


0, 


,339 





.978 





.995 


1 


.007 





,971 





,996 





.995 


NC.003103 








B-Rpr 


1111523 


0. 


.354 


0. 


,144 





,146 


0, 


.356 





.955 


1. 


.023 


1 


.018 





,988 





,996 





.990 


NC.000963 


B. 


.12.1.5.1. 


1 


B-Ccr 


4016947 


0. 


.165 





,337 





,335 





163 





.986 





.994 





.976 


1 


,030 





996 





.984 


NC.002696 


B. 


.12.1.6.1. 


2 


B-Atul 


2841490 


0. 


.205 





,300 





,294 





,202 


1 


.018 





.978 





.977 





,979 





991 





.990 


NC.003304 








B-Atu2 


2075560 


0. 


.203 





,297 





,296 





,204 


1 


.038 





.959 





.965 


1 


,017 





,998 





.993 


NC_003305 


B. 


.12.1.6.1. 


6 


B-Smel 


3654135 


0. 


.186 





,315 





,312 


0, 


,186 





.975 





.971 





.971 


1 


,002 





.997 





.993 


NC.003047 








B-Sme2 


1354226 


0. 


200 





,303 





,301 





,197 





.971 





.992 





.988 





,973 





995 





.998 


NC_003037 








B-Sme3 


1683333 


0. 


.188 





,311 





,313 





,188 





.991 





.975 





.991 





,974 





998 





.992 


NC.003078 


B. 


12.1.6.3. 


1 


B-Bmel 


2117144 


0. 


.214 





,285 





,287 





,215 


1 


.011 





.963 





.943 


1 


,005 





,997 





.993 


NC.003317 








B-Bme2 


1177787 


0. 


.214 





,286 





,288 





,213 


1. 


.006 





.968 





.942 





,970 





,997 





.984 


NC.003318 








B-Bsul 


2107792 


0. 


.214 





,287 





,285 





,214 


1. 


.005 





.946 





.965 


1 


,006 





,998 





.995 


NC_004310 








B-Bsu2 


1207381 


0. 


.213 


0. 


,287 





,286 





,214 





.970 





.943 





.970 


1 


,006 


0, 


998 





.984 


NC_004311 


B. 


.12.1.6.4. 


.6 


B-Mlo 


7036074 


0. 


.186 


0. 


,316 





,311 





,186 


1 


.000 





.973 





.970 





,992 





995 





.997 


NC.002678 


B. 


.12.1.6.7, 


.1 


B-Bja 


9105828 


0. 


.180 


0. 


,320 





,320 





,180 





.967 





.986 





.980 





,969 


1 


,000 





.998 


NC_004463 


B. 


.12.2.1.2. 


1 


B-Rso 


3716413 


0. 


.164 





,333 





,337 


0, 


,166 


1 


.032 





.946 





.948 


1 


,062 





,994 





.992 


NC.003295 


B. 


.12.2.1.4. 


3 


B-Bbr 


5339179 


0. 


.159 


0. 


.339 





.342 





,160 


1 


.000 





.950 





.978 


1 


,060 





,996 





.978 


NC_002927 








B-Bpa 


4773551 


0. 


.159 





,338 





,343 





,160 


1 


.020 





.952 





.982 


1. 


,061 





,994 





.982 


NC_002928 








B-Bpe 


4086189 


0. 


.161 





,337 





,340 





,161 


1 


.011 





.972 





.963 


1 


,013 





,997 





.997 


NC_002929 


B. 


.12.2.4.1. 


.1 


B-Nmel 


2272351 


0. 


.242 





,256 





.260 


0, 


,243 





.969 





.917 





.950 





,944 





.995 





.985 


NC_003112 








B-Nme2 


2184406 


0. 


.240 





,259 





.259 


0, 


,242 





.970 





.934 





.944 





,962 





,998 





.995 


NC_003116 


B. 


.12.2.4.1. 


5 


B-Cvi 


4751080 


0. 


.175 





,324 





,324 





,176 


1 


.109 





.881 





.884 


1 


,125 





,999 





.995 


NC_005085 
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TABLE II: (Continued) 



Beygey code Sp/str Length A% C% G% T% /3 A /?c Pa §r Sl ase Sj eta Acc. No. 



B.12. 


.3. 


3.1.1 


B- 


-Xax 


5175554 


0. 


.176 





.324 





,323 





,176 


1 


.037 





.958 





,951 


1 


.071 





,999 





,990 


NC.003919 








B- 


■Xca 


5076188 


0. 


.175 





,325 





,325 





,174 


1, 


.056 





.941 





.958 


1. 


.049 





999 





.994 


NC_003902 


B.12 


.3. 


.3.1.9 


B- 


■Xfa 


2679306 


0. 


.225 





.249 





,277 





,248 





.962 


1. 


.062 


1 


.051 





.946 


0. 


,949 





.993 


NC_002488 


B.12. 


.3. 


6.2.1 


B- 


■Cbu 


1995275 


0. 


.287 





.213 





,213 





,286 


1 


.003 





.967 





.994 


0. 


.982 





999 





.988 


NC_002971 


B.12. 


3. 


9.1.1 


B- 


-Pae 


6264403 


0. 


.169 





.336 





,330 





,166 


1 


.107 





.932 





.928 


1 


.100 





,991 





.997 


NC_002516 








B- 


-Ppu 


6181863 


0. 


.192 





,306 





,310 





,193 


1 


.103 





.929 





.925 


1 


.034 





,995 





.982 


NC_002947 








B- 


■Psy 


6397126 


0. 


.208 





.292 





,292 





.208 


1 


.027 





.935 





.928 


1 


.010 


1 


,000 





.994 


NC_004578 


B.12 


3. 


10.1.7 


B- 


■Son 


4969803 


0. 


.270 





,230 





,230 





,270 


1 


.009 





.999 





.969 





992 


1 


,000 





.988 


NC.004347 


B.12. 


.3. 


11.1.1 


B- 


■ Vchl 


2961149 


0. 


260 





,238 





239 





,263 


1 


.040 





.919 





.961 


1 


.033 





,996 





.988 


NC_002505 








B- 


-Vch2 


1072315 


0. 


.265 





,233 





,236 





,266 


1. 


.008 





.920 





.956 





.927 





996 





.969 


NC.002506 








B- 


- Vvul 


3281945 


0. 


.267 





,231 





.233 





269 


1. 


.042 





.949 





.946 





.970 





,996 





.981 


NC.004459 








B- 


-Vvu2 


1844853 


0. 


.264 





,236 





.235 





,265 





.999 





.935 





.938 


1 


.001 


0. 


998 





.999 


NC_004460 








B- 


-Vpal 


3288558 


0. 


.272 


0. 


,227 





.227 





,274 





.972 





.988 





.974 





.988 





,998 





.992 


NC.004603 








B- 


-Vpa2 


1877212 


0. 


.272 





,227 





.226 





,274 





.983 





.982 





.938 





.979 





,997 





.988 


NC_004605 


B.12. 


3. 


13.1.5 


B- 


-BapS 


641454 


0. 


.375 





,125 





.128 





,372 





.942 


1. 


.082 


1 


.158 





.984 





,994 





.972 


NC_004061 








B- 


■Bsp 


640681 


0. 


.371 





,131 





.132 





366 





.954 


1 


.085 


1 


.151 





.978 





,994 





.978 


NC_002528 








B- 


■Bap 


615980 


0. 


.371 





,127 





.127 





,375 





.922 


1 


.016 


1 


.056 





.970 





996 





.978 


NC_004545 


B.12. 


3. 


13.1.13 


B- 


-Ecol 


5231428 


0. 


.248 





,253 





.252 





,247 


1 


.021 





.918 





.923 


1 


.009 





,998 





.996 


NC_004431 








B- 


-Eco2 


4639221 


0. 


.246 





,254 





.254 





,246 


1. 


.023 





.925 





.934 


1 


.005 


1. 


,000 





.993 


NC_000913 








B- 


-Eco3 


5498450 


0. 


.248 





,252 





.253 





,247 


1 


.028 





.918 





.934 


1 


.025 





.998 





.995 


NC.002695 








B- 


-Eco4 


5528445 


0. 


.248 





,252 





.252 





,247 


1 


.028 





.920 





.927 


1 


.026 





999 





.998 


NC_002655 


B.12 


.3. 


13.1.32 


B- 


■Styl 


4809037 


0. 


.239 





,260 





.261 





,240 


1 


.014 





.922 





.929 


1. 


,010 





998 





.997 


NC.003198 








B- 


-Sty2 


4857432 


0. 


239 





,261 





.261 





,239 


1, 


.024 





.928 





.925 


1 


,003 


1 


,000 





.994 


NC.003197 








B- 


■StyS 


4791961 


0. 


.239 





,260 





.261 





,240 


1 


.025 





.912 





.930 


1. 


,015 





,998 





.993 


NC.004631 


B.12 


3. 


13.1.34 


B- 


-Sfl 


4607203 


0. 


.246 





,255 





.254 





,245 


1 


.043 





.951 





.931 


1 


,020 





,998 





.989 


NC.004337 


B.12 


3. 


13.1.38 


B- 


-Wbr 


697724 


0. 


.388 





,113 





.112 





,387 





.906 


1. 


.044 


1 


.083 





.920 





,998 





.987 


NC_004344 


B.12. 


3. 


13.1.40 


B- 


-Ypel 


4653728 


0. 


.262 





,237 





.239 





,262 


1 


.024 





.937 





.944 


1 


.018 





998 





.997 


NC_003143 








B- 


-Ype2 


4600755 


0. 


.261 





,237 





.239 





,263 


1. 


.023 





.952 





.930 


1 


.021 





,996 





.994 


NC_004088 


B.12 


3. 


14.1.1 


B- 


■Pmu 


2257487 


0. 


.299 





,199 





.205 





,297 


1 


.005 





.979 





.997 


1 


.001 





992 





.994 


NC.002663 


B.12. 


.3. 


14.1.3 


B- 


-Hin 


1830138 


0. 


.310 





,192 





.190 





,308 





.979 


1 


.045 


1 


.051 





.990 





,996 





.996 


NC.000907 








B- 


-Hdu 


1698955 


0. 


.305 





,185 





.197 





,312 





.980 


1. 


.015 


1 


.025 





.949 





,981 





.990 


NC_002940 


B.12. 


.5. 


1.1.1 


B- 


■Cje 


1641481 


0. 


.348 





,153 





.152 





,346 





.981 





.880 





.960 





.965 


0. 


,997 





.975 


NC_002163 


B.12. 


,5. 


1.2.1 


B- 


-Hpyl 


1667867 


0. 


303 





,196 





.193 





,308 





.981 





.956 





.950 


1 


.016 





.992 





.989 


NC_000915 








B- 


-Hpy2 


1643831 


0. 


303 





,197 





.195 





,305 





.983 





.957 





.948 





.996 





.996 





.994 


NC_000921 








B- 


-Hhe 


1799146 


0. 


.322 





,182 





.177 





,319 





.996 





.946 





.931 





.997 





.992 





.996 


NC_004917 


B.12. 


.5. 


1.2.2 


B- 


- Wsu 


2110355 


0. 


.257 





,239 





.245 





,259 


1 


.083 





.914 





.947 


1 


.026 





.992 





.977 


NC.005090 


B.13 


.1. 


1.1.1 


B- 


■Cac 


3940880 


0. 


.346 





,154 





.155 





.345 





.891 





.807 





.965 





.896 





,998 





.954 


NC.003030 








B- 


■Cpe 


3031430 


0. 


.350 





,147 





.138 





.365 





.867 





.913 


1 


.015 





,858 





,976 





.970 


NC_003366 








B- 


■Cte 


2799251 


0. 


.353 





,146 





.141 





.359 





.870 





.952 





.925 





,830 





,989 





.981 


NC.004557 


B.13 


.1. 


2.1.8 


B- 


-Tte 


2689445 


0. 


.312 





,188 





.188 





.313 





.938 





.876 





.912 





,930 





,999 





.988 


NC.003869 


B.13 


,2. 


1.1.1 


B- 


-Mge 


580074 


0. 


.346 





,158 





.159 





.337 





.978 





.984 


1 


.072 





,930 





.990 





.966 


NC.000908 








B- 


-Mpe 


1358633 


0. 


.370 





,128 





.129 





.372 





.945 


1. 


.179 





.948 





,951 





997 





.941 


NC_004432 








B- 


■Mpn 


816394 


0. 


.305 


0. 


,200 





.201 





,295 





.990 


1. 


.017 





.998 





,970 





.989 





.990 


NC_000912 








B- 


-Mpu 


963879 


0. 


.370 





,133 





.133 





,364 





.917 


1 


.195 


1 


.261 





,918 





.994 





.984 


NC_002771 








B- 


■Mga 


996422 


0. 


.345 





,157 





.157 





,341 





.959 


1 


.010 


1 


.158 





.950 





996 





.961 


NC_004829 


B.13 


2. 


1.1.4 


B- 


- Uur 


751719 


0. 


.373 





,126 





.129 





,372 





.954 





.993 


1 


.130 





.944 





.996 





.963 


NC_002162 


B.13 


.3. 


1.1 


B- 


-Oih 


3630528 


0. 


.321 





,179 





.178 





,322 





.943 





.924 


1 


.009 





.934 





998 





.975 


NC_004193 


B.13 


.3. 


1.1.1 


B- 


■Ban 


5093554 


0. 


.323 





,178 





.175 





,325 





.939 





.865 





.890 





.951 





995 





.990 


NC.003995 








B- 


-Bha 


4202353 


0. 


.282 





,217 





.220 





,281 





.966 





.912 





.932 





.971 





.996 





.993 


NC.002570 








B- 


-Bsu 


4214814 


0. 


.282 





,218 





.217 





,283 





.941 





.931 





.957 





.941 





.998 





.993 


NC.000964 


B.13 


.3. 


1.4.1 


B- 


■Lin 


3011208 


0. 


.313 





,189 





.186 





,313 





.930 


1. 


.061 


1 


.006 





.924 





997 





.984 


NC_003212 








B- 


-Lmo 


2944528 


0. 


.310 





,191 





.189 





,310 





.942 


1 


.057 


1 


.000 





.933 





998 





.983 


NC_003210 


B.13 


3. 


1.5.1 


B- 


■Saul 


2878040 


0. 


.335 





,164 





.165 





,337 





.927 


1. 


.069 





.999 





.956 





.997 





.975 


NC_002758 








B- 


-Sau2 


2820462 


0. 


334 





,164 





.164 





,338 





.924 


1. 


.068 


1 


.028 





.959 





.996 





.981 


NC.003923 








B- 


-Sau3 


2814816 


0. 


334 





,164 





.164 





,337 





.929 


1 


.073 


1 


.003 





.959 





.997 





.975 


NC_002745 








B- 


■Sep 


2499279 


0. 


.335 





,162 





.159 





,344 





.945 


1 


.148 





.934 





.966 





.988 





.941 


NC_004461 


B.13 


.3. 


.2.1.1 


B- 


■Lpl 


3308274 


0. 


.277 





,223 





.222 





,278 





.986 





.882 





.907 





.980 





,998 





.992 


NC.004567 


B.13 


.3. 


2.6.1 


B- 


■Sagl 


2160267 


0. 


323 





,179 





.178 





,321 





.980 





.920 


1. 


.095 





.962 





,997 





.951 


NC_004116 
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TABLE II: (Continued) 



Beygey code 



Sp/str Length A% C% G% T% (3 A Pc Pa Pt Sl ase Si 



beta 



Acc. No. 



B. 13. 3.2. 6 
B. 14.(1.5) 



2 

(1.7).1.1 



B. 14.(1.5). (1.7).4.1 



B. 14.(1. 5). (1.11). 1.1 



B. 14.(1.5) 
B. 15. 1.1.1 
B. 16. 1.1.1 



2.1. 

4 

.1 



B. 16. 1.1. 1.2 



B. 17.1. 1.1 
B. 17.1. 1.1 
B. 17.1. 1.3 

B. 20.1. 1.1. 
B. 20. 1.1. 3. 
B. 21. 1.1.1. 



Sag2 
Smu 
Spnl 
Spn2 
Spyl 
Spy2 
Spy3 

S P y4 

Lla 

Cef 

Cgl 

Mle 

Mtul 

Mtu2 

Mho 

Sco 

Sav 

Bio 

Psp 

Cmu 

Ctr 

Cpnl 

Cpn2 

Cpn3 

Cpnl 

Cca 

Bbu 

Tpa 

Linl 

Lin2 

Bth 

Pgi 

Fnu 



2211485 
2030921 
2038615 
2160837 
1895017 
1900521 
1852441 
1894275 
2365589 
3147090 
3309401 
3268203 
4403836 
4411529 
4345492 
8667507 
9025608 
2256646 
7145576 
1072950 
1042519 
1229858 
1230230 
1226565 
1225935 
1173390 

910724 
1138011 
4332241 

358943 
6260361 
2343476 
2174500 



0.323 
0.315 
0.302 
0.303 
0.307 
0.305 
0.309 
0.309 
0.324 
0.184 
0.231 
0.210 
0.172 
0.172 
0.172 
0.139 
0.147 
0.200 
0.224 
0.299 
0.294 
0.296 
0.299 
0.299 
0.304 
0.299 
0.355 
0.235 
0.325 
0.324 
0.285 
0.258 
0.358 



0.178 
0.185 
0.198 
0.198 
0.192 
0.194 
0.191 
0.190 
0.176 
0.315 
0.270 
0.287 
0.329 
0.329 
0.329 
0.360 
0.354 
0.301 
0.279 
0.201 
0.206 
0.203 
0.203 
0.203 
0.196 
0.203 
0.144 
0.262 
0.174 
0.175 
0.213 
0.241 
0.140 



0.178 
0.183 
0.199 
0.199 
0.193 
0.192 
0.194 
0.195 
0.178 
0.316 
0.268 
0.291 
0.327 
0.327 
0.327 
0.361 
0.353 
0.301 
0.275 
0.202 
0.207 
0.203 
0.203 
0.203 
0.196 
0.202 
0.142 
0.266 
0.176 
0.177 
0.215 
0.242 
0.132 



0.321 
0.317 
0.301 
0.300 
0.308 
0.309 
0.306 
0.306 
0.323 
0.185 
0.231 
0.212 
0.172 
0.172 
0.172 
0.140 
0.146 
0.199 
0.222 
0.298 
0.293 
0.299 
0.296 
0.296 
0.303 
0.296 
0.359 
0.237 
0.325 
0.325 
0.287 
0.259 
0.370 



0.982 
0.986 
0.993 
0.995 
0.970 
0.972 
0.977 
0.978 
0.956 
1.038 
0.953 
0.971 
0.962 
0.964 
0.973 
1.007 
1.020 
1.126 
0.990 
0.976 
0.976 
0.996 
0.983 
0.984 
0.984 
0.988 
0.952 
0.974 
0.963 
0.998 
0.957 
1.010 
0.887 



0.918 
1.068 
0.905 
0.890 
0.970 
0.971 
0.973 
0.961 
1.090 
0.966 
0.980 
0.959 
1.134 
1.133 
1.135 
0.987 
1.034 
0.974 
0.987 
0.899 
0.893 
1.000 
0.954 
0.954 
0.954 
0.977 
0.964 
0.903 
0.977 
0.981 
0.896 
0.925 
0.873 



1.103 
1.028 
0.890 
1.016 
1.036 
1.067 
1.034 
1.014 
0.973 
0.959 
1.003 
0.957 
1.159 
1.163 
1.169 
0.985 
0.985 
0.947 
1.016 
0.957 
0.959 
0.955 
1.000 
0.998 
0.999 
0.985 
0.869 
0.898 
0.972 
0.960 
0.900 
0.934 
0.846 



0.969 
0.988 
0.999 
0.995 
0.979 
0.975 
0.976 
0.972 
0.985 
1.040 
0.960 
1.005 
0.975 
0.974 
0.976 
1.011 
1.016 
1.045 
0.989 
0.990 
0.997 
0.983 
0.996 
0.995 
0.995 
0.991 
0.966 
0.959 
0.959 
0.938 
0.945 
1.012 
0.895 



0.998 
0.996 
0.998 
0.996 
0.998 
0.994 
0.994 
0.992 
0.997 
0.998 
0.998 
0.994 
0.998 
0.998 
0.998 
0.998 
0.998 
0.999 
0.994 
0.998 
0.998 
0.997 
0.997 
0.997 
0.996 
0.999 
0.994 
0.994 
0.998 
0.997 
0.996 
0.998 
0.980 



0.950 
0.990 
0.994 
0.968 
0.981 
0.975 
0.984 
0.985 
0.964 
0.998 
0.992 
0.991 
0.991 
0.991 
0.991 
0.998 
0.987 
0.974 
0.992 
0.981 
0.977 
0.985 
0.985 
0.986 
0.986 
0.997 
0.971 
0.995 
0.998 
0.979 
0.996 
0.997 
0.990 



NC.004368 
NC_004350 
NC.003098 
NC.003028 
NC.003485 
NC.004070 
NC_002737 
NC.004606 
NC_002662 
NC.004369 
NC.003450 
NC_002677 
NC.002755 
NC_000962 
NC_002945 
NC_003888 
NC.003155 
NC.004307 
NC_005027 
NC.002620 
NC_000117 
NC_002179 
NC_000922 
NC.002491 
NC.005043 
NC_003361 
NC_001318 
NC_000919 
NC_004342 
NC_004343 
NC.004663 
NC.002950 
NC_003454 
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TABLE III: Information of Archaea. The Bergey code is the lineage of the organism according to the order: phylum, class, 
order, family, genus. Items are ordered by the Bergey code, so species/strains closely related are listed together. 



Beygey code 


Sp/str 


Length 


A% 


C% 


G% 


T% 




a 

OA 




a 

oc 




a 

t>G 




o 
o 


1 

base 


q1 


ACC. 1NO. 


A.l. 


.1. 


2. 


.1 


.3 


A-Ape 


1669695 


0. 


216 


0. 


.284 


0. 


.280 


0. 


.221 





.999 





.933 





m a 
.914 


1. 


.024 





.991 


0.989 


IN O-000854 


A.l. 


.1. 


3. 


.1 


.1 


A-bso 


2992245 





.319 


0. 


.179 


0. 


.179 


0. 


323 


n 



ClA *7 

.94/ 





.923 





.955 





.944 





.996 


0.991 


IN O-002/ 54 












A-bto 


2094/00 


0. 


.334 


0. 


.163 


0. 


.165 


0. 


.338 





.900 


1. 


no7 
.03/ 





.944 





.950 


n 



.994 


0.9/2 


lNC_00310b 


A.2. 


.1. 


1. 


.1 


.1 


A- Mill 


1/513 / / 


0. 


251 


0. 


.247 


0. 


.248 


0. 


.254 


n 
U 


.905 


u 


.93b 


n 
U 


.939 





.986 


U 


.990 


U.992 


IN 0_UUU91b 


A.2. 


.6. 


1. 


1 


.1 


A A 

A-Afu 


21 /5400 





258 


0. 


.242 


0. 


244 


0. 


.256 





.90/ 





.910 





.885 





.992 





.996 


n no7 
0.98/ 


IN O-00091/ 


A.2. 


.3. 


1 


1 


.1 


A-tisp 


2014239 


0. 


.161 


0. 


.340 


0. 


339 


0. 


160 


-i 
1 


.00/ 





.950 





.949 


1 


.030 





.998 


0.994 


IN O_002607 


A.2. 


.2. 


1. 


1 


.1 


A-Mja 


10049 /O 


0. 


344 


0. 


.155 


0. 


.159 


0. 


.341 


n 



.919 


n 



.910 





.909 





.927 





.993 


0.99b 


IN O_000909 


A.2. 


.7. 


1. 


1 


.1 


A-MKa 


1594969 


0. 


.195 


0. 


.307 





304 


0. 


.194 


-i 
1 


i nc? 
.100 





.943 





.951 


1 


.065 





.996 


0.988 


1NO-003551 


A.2. 


.2. 


3 


1 


.1 


A- Mac 


5751492 


0. 


285 


0. 


.214 


0. 


213 


0. 


.288 





.937 





.917 





.911 





.940 





.996 


0.998 


NC.003552 


A.2. 


.2. 


3. 


1 


.1 


A-Mma 


4096345 


0. 


293 


0. 


.207 


0. 


208 


0. 


.292 





.939 





.925 





.930 





.944 





.998 


0.997 


NC_003901 


A.2. 


.5. 


1. 


1 


.3 


A-Pab 


1765118 


0. 


276 


0. 


.224 


0. 


223 


0. 


.277 





.925 





.886 





.911 





.926 





.998 


0.993 


NC.000868 












A-Pfu 


1908256 


0. 


296 


0. 


.204 


0. 


204 


0. 


.296 





.924 





.897 





.953 





.914 


1 


.000 


0.982 


NC_003413 












A-Pho 


1738505 


0. 


290 


0. 


.212 


0. 


.207 


0. 


.291 





.938 





.967 





.922 





.933 





.994 


0.984 


NCL000961 


A.2. 


.4. 


1. 


.1 


.1 


A-Tac 


1564906 


0. 


.272 


0. 


.229 


0. 


231 


0. 


.268 





.949 





.947 





.954 





.944 





.994 


0.997 


NC.002578 












A-Tvo 


1584804 


0. 


302 


0. 


199 


0. 


200 


0. 


.299 





.941 





.946 





.950 





.965 





.996 


0.993 


NC.002689 



TABLE IV: Information of the Saccharomyces cerevisiae genome. 



Chromosome 


Length 


A% 


C% 


G% 


T% 


Pa 


Pc 


Pa 


Pt 


u base 


J beta 


chrl 


230203 


0.303 


0. 


.194 


0. 


199 


0.304 


0.957 


1.072 


1.034 


0.924 


0.994 


0.982 


chr2 


813139 


0.307 


0. 


.194 


0. 


190 


0.310 


0.975 


1.046 


1.054 


0.956 


0.993 


0.993 


chr3 


316613 


0.312 


0. 


.197 


0. 


188 


0.303 


0.942 


1.248 


1.025 


0.960 


0.982 


0.942 


chr4 


1531929 


0.311 


0. 


.189 


0. 


190 


0.310 


0.994 


1.129 


1.070 


0.951 


0.998 


0.975 


chr5 


576869 


0.306 


0. 


.190 


0. 


195 


0.309 


0.941 


1.024 


1.030 


0.969 


0.992 


0.991 


chr6 


270148 


0.307 


0. 


.193 


0. 


194 


0.306 


0.948 


1.128 


0.980 


1.007 


0.998 


0.949 


chr7 


1090937 


0.310 


0. 


.190 


0. 


190 


0.309 


0.955 


1.013 


1.011 


0.962 


0.999 


0.998 


chr8 


562639 


0.309 


0. 


.194 


0. 


191 


0.306 


0.990 


1.024 


1.162 


0.959 


0.994 


0.959 


chr9 


439885 


0.305 


0. 


.194 


0. 


195 


0.306 


0.973 


1.006 


1.161 


0.984 


0.998 


0.960 


chrlO 


745444 


0.310 


0. 


191 


0. 


193 


0.306 


0.971 


1.024 


1.128 


0.974 


0.994 


0.974 


chrll 


666445 


0.309 


0. 


.192 


0. 


189 


0.310 


0.972 


1.030 


1.113 


0.968 


0.996 


0.979 


chrl 2 


1078173 


0.307 


0. 


.193 


0. 


192 


0.309 


0.966 


1.090 


1.099 


0.971 


0.997 


0.997 


chrl 3 


924430 


0.310 


0. 


.191 


0. 


191 


0.308 


0.963 


1.050 


1.044 


0.948 


0.998 


0.995 


chrl 4 


784328 


0.308 


0. 


.193 


0. 


193 


0.306 


0.984 


1.072 


1.200 


0.975 


0.998 


0.968 


chrl 5 


1091284 


0.311 


0. 


.192 


0. 


190 


0.307 


0.972 


1.074 


1.054 


0.958 


0.994 


0.992 


chrl 6 


948061 


0.310 


0. 


.190 


0. 


190 


0.309 


0.950 


1.025 


0.986 


0.987 


0.999 


0.981 
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TABLE V: Information of the Homo sapiens genome. Contigs used for analysis is list in the column of "contig" and their 
length is also listed. 



Chromosome 


Contig 


Length 


A% 


C% 


G% 


T% 


Pa 


Pc 


Pa 


Pt 


S 1 

^base 


°beta 


chrl 


30 


36790572 


0.293 


0.207 


0.207 


0.293 


0.972 


1.148 


1.080 


0.971 


1.000 


0.983 


chr2 


5 


84213153 


0.306 


0.193 


0.194 


0.307 


0.962 


1.073 


1.080 


0.965 


0.998 


0.998 


chr3 


2 


100530261 


0.305 


0.195 


0.195 


0.305 


0.962 


1.198 


1.097 


0.959 


1.000 


0.975 


chr4 


8 


62915881 


0.314 


0.185 


0.186 


0.315 


0.945 


1.098 


1.070 


0.942 


0.998 


0.992 


chr5 


8 


41199371 


0.307 


0.193 


0.192 


0.307 


0.947 


1.023 


1.037 


0.946 


0.999 


0.996 


chr6 


6 


61695806 


0.308 


0.192 


0.191 


0.309 


0.966 


1.080 


1.083 


0.964 


0.998 


0.999 


chr7 


5 


64412912 


0.304 


0.196 


0.196 


0.304 


0.952 


1.094 


1.104 


0.950 


1.000 


0.997 


chr8 


2 


48689376 


0.306 


0.194 


0.194 


0.305 


0.969 


1.082 


1.074 


0.972 


0.999 


0.997 


chr9 


1 


39435726 


0.306 


0.195 


0.194 


0.305 


0.952 


1.102 


1.045 


0.956 


0.998 


0.985 


chrlO 


9 


43027086 


0.292 


0.207 


0.207 


0.294 


0.982 


1.077 


1.092 


0.971 


0.998 


0.994 


chrll 


2 


48854501 


0.296 


0.204 


0.204 


0.297 


0.954 


1.053 


1.052 


0.957 


0.999 


0.999 


chrl 2 


8 


38627316 


0.301 


0.200 


0.200 


0.300 


0.955 


1.014 


1.024 


0.948 


0.999 


0.996 


chrl 3 


3 


67740325 


0.310 


0.191 


0.190 


0.309 


0.955 


1.083 


1.119 


0.951 


0.998 


0.990 


chrl 4 


1 


87191216 


0.294 


0.204 


0.205 


0.297 


0.955 


1.154 


1.113 


0.956 


0.996 


0.990 


chrl 5 


2 


22003156 


0.291 


0.211 


0.210 


0.288 


0.958 


1.114 


1.085 


0.983 


0.996 


0.987 


chrl 6 


1 


53619965 


0.289 


0.211 


0.211 


0.289 


0.983 


1.091 


1.068 


0.981 


1.000 


0.994 


chrl 7 


5 


24793602 


0.282 


0.218 


0.218 


0.283 


0.951 


1.075 


1.175 


0.962 


0.999 


0.973 


chrl 8 


3 


33548238 


0.303 


0.197 


0.197 


0.302 


0.981 


1.075 


1.041 


0.984 


0.999 


0.991 


chrl 9 


1 


31383029 


0.262 


0.237 


0.238 


0.263 


0.971 


1.067 


1.060 


0.959 


0.998 


0.995 


chr20 


3 


26259569 


0.289 


0.209 


0.209 


0.293 


0.992 


1.093 


1.061 


0.980 


0.996 


0.989 


chr21 


1 


28602116 


0.306 


0.196 


0.195 


0.303 


0.971 


1.128 


1.289 


0.964 


0.996 


0.961 


chr22 


3 


23178213 


0.263 


0.237 


0.237 


0.263 


0.968 


1.050 


1.021 


0.993 


1.000 


0.987 


chrX 


9 


32736268 


0.304 


0.195 


0.195 


0.307 


0.973 


1.205 


1.167 


0.969 


0.997 


0.990 


chrY 


1 


9938763 


0.304 


0.194 


0.197 


0.305 


0.925 


1.199 


1.152 


0.913 


0.996 


0.986 



TABLE VI: Information of random and simulated by the minimal model genome. 



Chromosome 


Length 


A% 


C% 


G% 


T% 


Pa 


Pc 


Pa 


Pt 


S* 1 

^base 


°beta 


Random 


10000000 


0.250 


0.250 


0.250 


0.250 


0.994 


0.992 


0.993 


0.992 


1.000 


0.999 


Simulation 


1028001 


0.270 


0.242 


0.249 


0.239 


1.004 


0.986 


0.979 


0.988 


0.962 


0.994 



